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Abstract 


This  report  describes  two  novel  approaches  to  pose  invariant  object  representation  and  recogni¬ 
tion.  The  first  section  describes  an  efficient  approach  to  pose  invariant  pictorial  object  recog¬ 
nition  employing  spectral  signatures  of  image  patches  that  correspond  to  object  surfaces  which 
are  roughly  planar.  A  complete  affine  invariance  of  the  signatures  is  achieved  by  a  log-log  sam¬ 
pling  configuration  in  the  frequency  domain.  Based  on  Singular  Value  Decomposition  (SVD), 
the  affine  transform  is  decomposed  into  slant,  tilt,  swing,  scale  and  2D  translation.  Unlike  pre¬ 
vious  log-polar  representations  which  were  not  invariant  to  slant  (i.e.  foreshortening  only  in 
one  direction),  our  new  configuration  yields  complete  affine  invariance.  The  proposed  log-log 
configuration  can  be  employed  both  globally  or  locally  by  Fourier  or  Gabor  transforms.  A  novel 
model  based  affine  invariant  segmentation  scheme  enables  to  isolate  and  recognize  several  objects 
in  cluttered  images.  The  actual  signature  recognition  and  3D  pose  estimation  is  performed  by 
multi-dimensional  indexing  in  a  pictorial  dataset  represented  in  the  frequency  domain.  Experi¬ 
mental  results  with  a  dataset  of  26  models  show  100%  recognition  rates  in  a  wide  range  of  3D 
pose  parameters  and  imaging  degradations;  0  —  360°  swing  and  tilt,  0  —  82°  of  slant  (more  than 
1:7  foreshortening),  more  than  3  octaves  in  scale  change,  window-limited  translation,  high  noise 
levels  (0  dB)  and  significantly  reduced  resolution  (1:5). 

In  the  second  section,  a  novel  method  for  representing  3-D  objects  that  unifies  viewer  and 
model  centered  object  representations  is  presented.  A  unified  3-D  frequency-domain  representa¬ 
tion  (called  Volumetric/Iconic  Spectral  Signatures  -  V/ISS)  encapsulates  both  the  spatial  struc¬ 
ture  of  the  object  and  a  continuum  of  its  views  in  the  same  data  structure.  The  frequency-domain 
image  of  an  object  viewed  from  any  direction  can  be  directly  extracted  employing  an  extension 
of  the  Projection  Slice  Theorem,  where  each  Fourier-transformed  view  is  a  planar  slice  of  the 
volumetric  frequency  representation.  The  V/ISS  representation  is  employed  for  pose-invariant 
recognition  of  complex  objects  such  as  faces.  The  recognition  and  pose  estimation  is  based  on 
an  efficient  matching  algorithm  in  a  four  dimensional  Fourier  space.  Experimental  examples  of 
pose  estimation  and  recognition  of  faces  are  also  presented. 


J.  Ben~Arie 


3 


This  section  describes  a  method  for  pose  invariant  pictorial  recognition  of  3D  objects  employing 
frequency  domain  techniques.  By  the  term  pictorial  recognition  we  mean  that  the  recognition 
is  achieved  by  matching  in  feature  space  a  given  image  to  a  model  dataset  which  consists  of 
various  object  pictures.  Even  though  such  a  pictorial  model  dataset  contains  only  few  aspects 
for  each  3D  object  represented,  it  is  still  possible  to  achieve  robust  recognition  of  objects  in  a 
w'ide  range  of  viewing  directions  and  distances  -  if  one  employs  pose  invariant  matching  methods 
as  illustrated  in  this  section.  Hence,  the  number  of  pictures  required  to  represent  an  object  in 
the  database  could  become  quite  small. 

If  we  treat  pixel  values  as  real  numbers,  we  can  regard  each  picture  of  an  object  instance  as  a 
point  in  R^,  where  M  is  the  number  of  pixels  in  the  picture.  As  the  parameters  of  the  object’s 
pose  vary,  the  point  in  the  M-dimensional  space  traces  out  a  /-dimensional  manifold,  where 
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I  is  the  number  of  pose  parameters.  Additional  parameters  that  relate  to  illumination  and  sensor 
characteristics  may  further  increase  the  /-dimensionality.  Different  objects  generate  different 
manifolds  in  R^.  In  this  setting,  object  recognition  can  be  posed  as  finding  the  closest  point  of 
any  of  the  manifolds  to  a  given  test  image.  If  the  point  is  close  enough  to  one  of  the  manifolds, 
we  can  claim  that  the  given  test  images  belongs  to  a  particular  object  whose  manifold  was  the 
one  matched. 

The  problem  with  this  approach  is  that  such  setting  requires  construction  of  a  large  number 
of  very  complex  manifolds  which  correspond  to  changing  views  of  objects  as  a  function  of  their 
pose  in  space.  A  drastic  simplification  is  necessary  to  render  such  an  approach  practically 
implementable.  One  feasible  suggestion  is  to  find  a  pictorial  representation  which  is  invariant 
to  the  largest  number  of  pose  parameters.  For  each  parameter  eliminated,  we  can  reduce  the 
dimensionality  and  simplify  the  overall  representation. 

Many  approaches  have  been  suggested  in  the  area  of  invariant  pictorial  representation  and 
recognition.  Best  fitting  to  the  real  problem  are  methods  employing  perspective  projection  in¬ 
variants.  Such  is  the  work  of  Jacobson  and  Wechsler  [41]  who  employed  4D  Wigner  distribution 
[41]  combined  with  back  projection  to  achieve  perspective  invariance  in  6  dimensional  search 
space.  Since  perspective  invariance  leads  to  unmanageable  complexity  (of  4D  correlation  in  6D 
search  space),  it  is  advantageous  to  approximate  the  perspective  transformation  by  simple  ones 
such  as  the  affine  transformation.  Although  affine  transformation  is  only  an  approximation  of 
perspective  transformation,  it  reflects  quite  accurately  the  real  3D  geometric  distortions  of  a 
planar  object  when  the  dimensions  of  the  object  are  relatively  small  compared  with  the  distance 
between  the  imaging  system  and  the  object  itself.  Several  previous  works  suggested  affine  in¬ 
variant  recognition  of  planar  objects  based  on  invariant  moments  [20]  [19]  and  contours  [39]  [8]. 
In  real  imagery,  both  types  of  methods  require  accurate  segmentation  and  edge  grouping  and 
therefore  they  are  quite  sensitive  to  illumination,  noise,  clutter,  partial  occlusion  and  perspective 
geometrical  distortions.  On  the  other  hand,  our  approach  which  is  based  on  representing  the 
pictorial  dataset  in  the  frequency  domain  has  few  advantages.  First,  it  allows  to  eliminate  planar 
translation  effects  in  the  imaging  plane  by  considering  only  the  magnitude  of  the  Fourier  (or  Ga¬ 
bor)  transform.  Second,  the  representation  of  noise  and  clutter  in  the  frequency  domain  can  be 
easily  filtered  and  removed.  And  third,  as  demonstrated  in  Section  1.4.2,  the  frequency  based  sig¬ 
natures  are  quite  tolerant  to  distortions  that  arise  from  inaccurate  segmentation,  multiplicative 
illumination  effects  and  the  actual  perspective  imaging. 

Once  the  planar  translation  effects  are  removed  from  the  representation,  the  next  task  is  to 
achieve  invariance  to  the  other  pose  parameters,  i.e.  the  three  rotational  degrees  of  freedom  and 
the  remaining  translation  parameter,  i.e.  translation  normad  to  the  imaging  plane  (translation 
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along  the  optical  axis).  In  Section  1.2  and  in  [7]  [6]  [5]  [1]  [3],  we  show  that  if  we  limit  ourselves 
to  pose-invariant  recognition  of  planar  objects  and  surfaces,  the  above  parameters  can  be  rep¬ 
resented  by  slant,  tilt,  swing  and  scale  parameters*.  By  the  term  slant,  we  refer  to  the  angle 
between  the  normals  to  the  image  plane  and  the  object-plane.  The  tilt  is  defined  as  the  angle 
between  the  X-axis  in  the  image  plane  and  the  axis  of  intersection  of  the  object-plane  with  the 
image  plane  (tilt  axis).  In  an  orthographic  projection  of  planar  shapes,  slant  causes  foreshort¬ 
ening  only  along  the  normal  to  the  tilt  axis  in  the  image  plane,  while  distances  along  the  tilt 
axis  remain  unaltered.  Previous  approaches  developed  for  pictorial  recognition  which  are  based 
on  log-polar  representations  in  the  frequency  domain  [14]  [11]  [43]  [22]  or  in  the  spatial  domain 
[40]  are  not  invariant  to  uneven  distortion  caused  by  foreshortening.  The  log-polar  configuration 
is  invariant  only  to  scale  and  rotation,  i.e.  similarity  transformation.  However,  the  similarity 
transform  is  only  a  subset  of  the  complete  affine  transform  and  cannot  represent  all  the  geometric 
distortions  caused  by  orthographic  projection. 

In  this  section,  an  affine-invariant  representation  is  achieved  by  sampling  the  frequency  do¬ 
main  representation  in  a  novel  configuration  which  is  logarithmic  in  two  orthogonal  axes,  i.e 
log-log  configuration.  As  elaborated  Section  1.2  the  log-log  configuration  is  invariant  to  trans¬ 
lation,  slant  and  scale.  Invariance  to  the  remaining  degrees  of  freedom  i.e.  to  tilt  and  swing 
(rotation  around  the  optical  axis)  is  attained  by  a  union  of  swung  log-log  configurations.  As 
described  in  Section  1.2  and  in  [7]  [6]  [5],  it  is  feasible  to  derive  the  spectral  signatures  by  several 
methods  that  include  short-term  Fourier  transform,  Gabor  transform  and  also  two  dimensional 
Gaussian  derivatives.  All  these  methods  are  intended  to  obtain  a  spatially  local  representation  of 
image  patches  in  the  frequency  domain.  Local  representations  enable  to  independently  recognize 
several  image  patches  in  the  same  image.  Hence,  an  object  which  is  composed  of  several  roughly 
planar  surfaces  can  be  robustly  recognized  by  recognizing  a  few  of  its  surfaces  or  parts. 

We  choose  to  use  the  Gabor  kernels  since  Gabor  functions  yield  the  smallest  conjoint  space- 
bandwidth  product  permitted  by  the  uncertainty  principle  of  Fourier  analysis  [17]  [41].  This 
allows  us  to  derive  local  frequency  characteristics  of  image  patches  since  Gabor  kernels  form  a 
complete  basis  for  signal  representation.  Since  the  local  Gabor  signature  obtained  is  still  sensitive 
to  location  of  the  centers,  we  develop  in  Section  1.4.2  a  model  based  affine  invariant  segmentation 
method.  This  approach  enables  to  segment  image  regions  with  predetermined  shape  (rectangular, 
circular  etc.)  with  any  affine  distortion.  The  segmentation  method  is  based  on  image  convolution 
with  a  set  of  basis  functions  derived  by  Karhunen-Loeve  (K-L)  transform.  To  achieve  more 
accurate  segmentation,  an  additional  stage  of  fiexible  matching  is  also  included.  The  signature 

^In  this  section,  we  use  the  terms  “slant”  and  “tilt”  to  denote  plane  rotations  in  orthographic  projection.  To 
avoid  confusion,  we  comment  that  these  terms  are  usually  employed  in  the  context  of  perspective  projections. 
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Matching  Pair 


Figure  2:  Invariance  of  signatures  to  swing  and  tilt  of  the  shape.  For  the  displayed  swing  and  tilt  of 
the  airplane  shape,  the  oriented  kernel  pairs  A-A’,  B-B’,  C-C’,  and  D-D’  yield  invariant  signatures. 
Note  that  since  the  kernels  have  symmetry  properties,  it  is  required  to  implement  only  one  quadrant 
of  kernels  spanning  90  degrees  of  orientation  to  achieve  360  degrees  of  swing  and  tilt  invariance. 

of  each  segmented  region  is  then  derived  independently  and  objects  can  be  recognized  even  in 
cluttered  scenes  as  demonstrated  in  Section  1.4.2. 

In  Section  1.2  we  provide  a  mathematical  description  of  the  affine  invariant  representation 
both  in  the  spatial  and  frequency  domains.  In  Section  1.3  we  describe  the  recognition  techniques 
and  in  Section  1.4  we  illustrate  the  experiments  which  achieve  quite  a  robust  recognition  in  a 
wide  range  of  viewing  conditions. 

1.2  Affine-Invariant  Spectral  Signatures  (AISSs) 

Our  overall  approach  is  based  on  pictorial  recognition  of  image  patches  that  correspond  to  object 
surfaces  that  are  approximately  planar.  As  elaborated  later,  object  surfaces  can  be  recognized 
in  a  general  3D  pose.  The  class  of  objects  that  can  be  recognized  is  not  limited  to  convex 
objects  and  also  includes  concave  objects  or  objects  with  holes,  etc.  As  long  as  an  object  has 
at  least  one  approximately  planar  surface  with  distinctive  features,  it  may  be  recognized  by  this 
approach.  As  experimental  results  demonstrate  in  Section  1.4,  many  non-planar  objects  which 
have  approximately  flat  shapes  such  as  hands,  airplanes,  etc.  are  robustly  recognized  with  our 
approach  as  well. 

We  use  the  affine  transformation  to  simulate  transformation  of  a  planar  shape  that  undergoes 
3D  rotation  and  3D  translation,  and  is  then  orthographically  projected  onto  the  image  plane  and 
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scaled  (reduced  or  increased  in  size).  A  point  X.  =  {x,yV  the  coordinate  system  of  the  shape 
is  affine-transformed  to  a  point  in  imaging  plane’s  coordinate  system  =  ixa,yaV  according 
to  the  following  formula: 

{i)=<’(:)*(!)-(s:  «)(:)*(!)  » 

where  matrix  C  represents  tilt  and  slant  operation,  {g,  hY  denotes  translation.  This  general 
formulation  represents  any  orthographic  projection  plus  scaling  of  planar  shapes.  Such  a  projec¬ 
tion  approximates  perspective  projections  quite  accurately  if  the  viewing  distance  of  the  shape  is 
relatively  large  with  respect  to  the  shape’s  dimensions.  Based  on  Singular  Value  Decomposition 
(SVD),  the  matrix  C  can  be  decomposed  as  follows: 

/ cn  Ci2'\  _  /  cos(^)  sin(^)\  / s/JTx  0  / cos(r)  -  sin(T)\ 

VC2iC22y  \-sin(^i>)  cos(«^)y\^  0  V^y\^sin(r)  cos(r)  j 

where  Ai  and  A2  are  eigenvalues  of  CCF,  0  and  r  are  angles  related  to  eigenvectors  of  CCf^ 
and  Cf^C.  In  practical  situations,  C  is  usually  a  nonsingular  matrix,  so  Ai  and  A2  have  positive 
values.  If  we  arrange  the  eigenvalues  so  that  Ai  >  A2,  the  eigenmatrix  A  can  be  posed  as 


According  to  Eq.  (1)  and  Eq.  (2),  any  orthographic  projection  of  points  on  a  plane  can  be 
represented  by  a  sequence  of  transformations  which  include  translation,  tilt‘,  slant,  scale  and 
swing  (rotation  around  the  optical  axis).  To  represent  3D  rotation  of  a  plane  it  is  necessary  to 
use  slant  and  tilt  transformations  in  which  the  shape  is  posed  on  a  plane  which  is  slanted  and 
tilted  with  respect  to  the  imaging  plane.  Slant  angle  is  measured  between  the  normals  of  the 
imaging  and  shape  planes.  Tilt  is  defined  as  the  angle  between  the  the  A-axis  in  the  imaging 
plane  and  line  L  created  by  the  intersection  of  the  imaging  and  shape  planes  (the  tilt  axis  L). 
Here,  this  angle  is  defined  as  r  in  Eq.  (2).  <f)  in  Eq.  (2)  represents  shape  swing  within  the  imaging 
plane.  The  slant  angle  a  corresponds  to  shape  foreshortening  in  the  imaging  plane  along  the 
axis  normal  to  the  line  L.  The  foreshortening  ratio  is  equal  to  cos(<t)  =  yJXi/Xx.  In  contrast 
to  slanting,  scaling  causes  uniform  foreshortening  (or  enlargement)  in  the  imaging  plane  in  all 
directions.  The  scale  factor  is  equal  to  Ai  in  Eq.  (3).  The  above  parameters  of  translation,  swing, 
scale,  slant  and  tilt  completely  represent  the  scaled  orthographic  projection. 

When  a  planar  object  undergoes  affine  transformation,  the  frequency  spectrum  of  its  image 
is  also  transformed  by  a  similar  set  of  transformations.  Given  a  function  /(A)  with  Fourier 
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TVansform  as  T{u,v),  its  affine  transformed  version  fai20  =  /(C“*2C  -  has  the 

frequency  spectrum  as  follows: 

l^•(«.,t;a)|  =  |C||:F{ti,u)l 

cos(<^)  sin(^)  0  Ycos(t)  —  sin(r)Vu\ 

\va)  V-sin(^)  cos(0)^  0  cos(r) 

Thus,  the  effect  of  affine  transformation  on  the  spectrum  is  almost  the  same  as  that  of  the  affine 
transform  on  the  object  in  spatial  domain  except  for  two  major  differences:  First,  the  spectrum 
is  inversely  scaled  and  slanted.  Secondly,  shape  translations  parallel  to  the  image  plane  do  not 
affect  the  spectrum. 

The  coordinate  transformation  in  Eq.  (4)  can  be  rewritten  as 

( cos(<^)  -  sm{<f>)\  (ua\^( Ar‘/2  0  Ycos(r)  -  sin(r)YM\ 

cos{<f>)  J\va)  V  0  cos(r)  J\v) 

From  Eq.  (5),  sampling  the  affine  transformed  shape’s  spectrum  \Ta{ua,  Ua)!  along  two  orthog¬ 
onal  directions  at  angles  <!>  and  7r/2  results  in  a  spectral  representation  we  call  spectral  signa¬ 
ture  7Va(^^i,  W2,  <t>),  where  a;,  =  log,(|ua  cos((^)  -  sin(«i)|)  and  a;2  =  log,(|ti„  +  u,  cos((^) |). 

Sampling  the  original  shape’s  spectrum  \T{u,  v)|  along  two  orthogonal  directions  at  angles  r  and 
r-|-7r/2  results  in  the  model’s  spectral  signature  N{u)i ,  012,  t),  where  cji  =  log,.(|u  cos(r)  -u  sin(T)  |) 
and  a;2  =  log,.(|usin(r)  -f-  i;cos(r)|).  The  two  spectral  signatures  thus  derived  are  related  as 
=  |C'|A^(a;i  -  ai,u}2  -  a2,T),  where  Qi  =  logr(\/Ai’)  and  a2  =  logr(\/A^).  We  note 
that  the  signature  is  not  altered  due  to  slanting  and  scaling  but  only  is  translated  in  {u}i,U2) 
plane  (see  in  Fig.  4  and  Fig.  5). 

It  is  noted  here  that  a  2D  Cartesian  version  of  the  Mellin  transform  -  implemented  here  in 
the  frequency  domain  -  which  is  defined  as 

Mai^U  6,  <t>)=JJ\^aiUa  COs{(f>)  -  Va  Sin(«>),  Ua  Sin(^)  -|-  Va  COs(^))| 

(«„  cos(^)  -  Va  sini<l>))^(^-\ua  sm{4))  +  Va  COs(^))^f*-*dUa  dVa  (6) 

also  achieves  invariance  to  slanting  and  scaling  which  result  in  linear  phase  shifts  proportional 
to  ln(>/A7)  and  ln(\/A2). 

Ma{^i,^2,<f>)  is  the  Mellin  transform  of  |.Fo(ua,  Uo)]  with  axis  direction  at  is  the 

Mellin  transform  of  the  original  spectrum  \T{u,  u)|  with  axis  direction  at  r. 
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The  estimation  of  slant  and  scale  parameters  from  the  relative  phase  shift  between  Ma{^u  <t>) 
and  A/(^i,  ^2i  t)  is  quite  difficult.  On  the  other  hand,  these  parameters  are  easier  to  estimate  from 
the  relative  shift  between  our  spectral  signatures  N{ui,U2,  r)  and  ^4(^1,  u/2,  (f>)  in  (a>i,a>2)  plane. 
Hence,  the  signature  from  the  affine  transformed  object  is  a  shifted  version  of  the 

signature  7V(a;i,  0/2,  r)  from  the  object  itself  except  for  a  scalar  \C\.  The  shift  in  the  2D  signature 
plane  {ijJi,uj2)  directly  depends  on  the  slant  and  the  scale  included  in  the  affine  transformation. 
In  order  to  account  for  the  remaining  two  rotational  degrees  of  freedom,  i.e.  swing  and  tilt,  we 
generate  for  the  affine  transformed  object  a  set  of  signatures  {A^a(u;i,a;2,  ^i):0°  <  0i  <  360°} 
which  have  equally  spaced  orientations  and  which  span  the  range  of  360  degrees.  A  set  of  signa¬ 
tures  { A(a;i,  0)2,  ^2);  0°  <62  <  360°}  for  the  model  are  also  created  in  the  same  way.  Among  the 
set  of  pictorial  signatures  generated,  there  exists  one  which  matches  the  signature  of  the  model 
object  except  for  a  translation  in  the  {(jJ\,u}2)  plane  -  which  represents  scale  and  slant  differences. 

Figure  1  displays  a  block  diagram  of  the  overall  system.  The  image  is  correlated  with  a 
set  of  Gabor  kernels.  The  frequencies  of  the  kernels  are  derived  from  a  logarithmic  sampling 
according  to  Eq.  (8)  and  Eq.  (9).  This  set  is  centered  at  various  ‘interest  locations’  which 
correspond  to  approximate  centers  of  prominent  image  patches^.  A  set  of  spectral  signatures 
is  then  generated.  Each  signature  represents  a  local  image  patch.  These  signatures  are  then 
independently  recognized  using  Multidimensional  Indexing  (MDI).  The  3D  pose  (slant,  tilt,  swing 
and  scale)  of  each  recognized  patch  is  also  obtained  as  a  by-product. 


The  affine-invariant  representation  presented  in  this  section  is  based  on  a  set  of  elliptical  2D 
Gabor  kernels  defined  as 


(Xl  \  _  f  cos  61 
yi  )  ~  \  -sinOi 


sind{ 

cosOi 


(8) 


where  /x,  fy  are  frequency  coefficients,  fx,fy  =  1...A/.  The  standard  deviations  ax^  and  ay„  of 
these  elliptical  kernels  vary  in  a  geometrical  progression  with  the  indices  m  and  n  as 

(^Yn  =  m,  n  =  (9) 

where  the  geometric  ratio  7  >  1  and  the  smallest  standard  deviation  oq  are  constants.  The 
indices  m  and  n  define  a  signature  space  (m,  n)  and  also  determine  the  sampling  points  in  the 
{u}i,U2)  plane  for  a  given  set  of  ctq,  /x  and  fy  .  In  addition,  the  Gabor  in  Eq.  (8)  is  modulated 
in  two  orthogonal  axes  (which  have  orientation  9i  denoted  by  Xi  and  Yi)  by  a  complex  sinusoid 

*As  described  in  Section  1.4.2,  these  patches  are  first  segmented  from  the  image  and  the  signature  obtained  is 
thus  not  sensitive  to  the  exact  locations  of  the  center  nor  to  neighboring  image  regions 
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Figure  3:  Partial  subset  of  kernels  K^‘  (/*  =  1  and  fy  =  1)  for  orientation  61  =  0  degrees  in  spatial 
domain  (left)  and  the  configuration  of  corresponding  kernels  in  frequency  domain  (right).  Each  subset 
completely  spans  the  frequency  domain. 


with  periods  proportional  to  the  deviations  of  corresponding  Gaussian  profile.  The  parameter  9i 
denotes  the  orientation  of  the  kernel  and  uniformly  spans  the  range  [0, 360)  degrees  in  discrete 
steps  I  =  l...Ne. 

The  above  scheme  generates  a  subset  of  modulated  Gaussian  kernels 
\  m,n  —  l...A^a}  with  identical  orientation  61  and  identical  frequency  coefficients  (denoted  by 
and  fy),  but  with  varying  aspect  ratio  and  size  (indexed  by  and  ay„).  For  each  orientation 
$1,  we  have  a  cumulative  subset  K^‘  of  kernels  which  includes  all  the  frequency  coefficients,  i.e., 
/i,  fy  =  l—Nf.  The  complete  set  of  kernels  K  consists  of  the  union  of  all  the  subsets  A'®'  swung 
to  different  orientations  di  •,  1  =  l..Ne  that  uniformly  span  360  degrees.  In  practice  it  is  required 
only  to  generate  kernels  that  span  one  quadrant  (90  degrees)  of  orientation.  All  the  other  kernels 
can  be  constructed  from  this  reduced  set  using  symmetry  properties.  An  example  of  one  subset 
of  kernels  K^‘  with  0i  =  O  degrees  is  illustrated  in  Fig.  3  (left,  only  the  real  parts  of  the  kernels). 
The  frequency  spectrum  of  this  subset  of  kernels  is  also  illustrated  in  Fig.  3  (right),  and  shows 
that  each  subset  of  kernels  completely  spans  the  band-limited  frequency  domain  of  interest 
and  is  logarithmically  spaced  as  needed. 

When  a  local  image  patch  I{x,  y)  is  correlated  with  this  configuration  of  kernels,  it  generates 
a  set  of  multi-dimensional  spectral  signatures  fy  =  l...Nf  ,  I  =  l...Ne}  composed  of 

the  correlation  (projection)  coefficients  of  all  the  kernels.  Explicitly, 

=  I  <  !/),/(!,»)  >  I 


(10) 
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for  Tn,n  =  I...N0,  where  |  •  |  denotes  modulus  of  a  complex  number. 

The  property  of  slant-invariance  arises  from  the  fact  that  when  the  image  patch  corresponds 
to  a  slanted  shape,  say  in  axis  Xi,  its  signature  shifts  in  the  direction  of  the  slant,  i.e. 

L1J2,  with  respect  to  the  signature  of  the  unslanted  shape.  Also,  when  the  shape  is  scaled,  all  the 
signatures  ;  I  =  l...Ng}  shift  equally,  i.e.  diagonally,  in  the  {uji,uj2)  plane.  Hence,  any 

combined  slant  and  scale  results  in  a  corresponding  shift  in  the  plane.  Fig.  4  illustrates 

these  properties  of  the  signature.  Fig.  5  displays  contour  plots  of  the  signature  of  the  airplane 
model  and  the  corresponding  signature  when  the  airplane  is  slanted  by  60  degrees  with  a  tilt  of  15 
degrees.  It  is  easily  observed  (see  the  labels  A  and  B  on  the  plots  displayed  for  easy  registration) 
that  the  signature  does  not  change  except  for  a  translation  in  the  (0^1,  ^2)  plane.  The  translation 
between  a  model  signature  and  the  image  signature  can  be  used  to  compute  the  relative  3D  pose 
between  the  two.  The  difference  in  cji  and  u>2  can  be  directly  translated  into  relative  slant  and 
scale.  The  other  angular  pose  parameters  of  tilt  and  swing  can  also  be  retrieved  as  described 
below.  The  X  and  Y  coordinates  are  derived  from  the  image,  and  the  depth  parameter  can  be 
derived  from  the  scale. 

Since  shapes  can  be  slanted  and  tilted  in  any  orientation  in  space,  one  has  to  generate  a  subset 
of  kernels  for  each  tilt  direction  and  for  each  orientation,  which  forms  two  rotational  degrees  of 
freedom.  These  two  degrees  of  freedom  are  dealt  with  by  using  the  complete  set  of  kernels  K 
both  for  the  model  signature  and  for  the  image  signature.  This  is  demonstrated  in  Fig.  2,  where 
it  is  shown  that  even  if  the  model  is  tilted  and  swung,  there  is  exact  correspondence  between 
four  of  the  model  signatures  (marked  by  labels  A  through  D)  and  four  of  the  image  signatures 
(marked  by  labels  A’  through  D’).  This  invariance  to  swing  and  tilt  is  possible  only  because 
both  the  model  and  the  image  are  processed  by  subsets  of  kernels  at  different  orientations.  In 
Section  1.4,  it  is  experimentally  found  that  sampling  of  7.5  degrees  in  9i  achieves  a  sufficient 
interpolation  to  accommodate  any  intermediate  values  of  tilt  and  swing. 

From  Eq.  (10),  we  see  that  the  signatures  are  related  only  to  the  magnitudes  of  the  complex 
correlation  coefficients,  the  phase  information  being  completely  eliminated.  Thus,  the  signatures 
obtained  are  -  to  a  large  extent  -  invariant  to  limited  translation  of  the  object  within  the  localized 
Gabor  support  (approximately  ±cr). 

Hence,  the  combined  set  of  kernels  K,  composed  of  all  the  subsets  K^‘  sufficiently  covers 
scale,  slant,  tilt,  swing,  and  translation,  i.e.  all  affine  transformation  parameters  that  simulate 
the  scaled  orthographic  projection. 
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1.3  Affine-Invariant  Recognition  by  Multi-Dimensional  Indexing 

Our  recognition  scheme  is  based  on  the  affine-invariant  nature  of  the  spectral  signatures  described 
in  Section  1.2.  As  explained  above,  when  the  shape  is  slanted  with  a  tilt  axis  of  orientation  the 
signature  -  which  corresponds  to  the  tilt  orientation  6i  -  undergoes  simple  shifts  in  the 

{u)i,uj2)  plane  that  correspond  to  scale  and  slant  transformations.  The  purpose  of  the  indexing 
scheme  is  to  robustly  identify  the  image  patch  from  its  set  of  signatures.  Each  signature 
corresponds  to  a  combination  of  orientation  9i  and  the  frequency  coefficients  /*,  fy.  A  robust 
recognition  scheme  is  required  since  the  signatures  could  be  partially  distorted  due  to  illumination 
variations,  due  to  the  discrete  nature  of  the  orientation  or  due  to  the  limited  range  of  scales. 
Furthermore,  irrelevant  clutter  in  the  receptive  field  and  partial  occlusion  can  result  in  additional 
distortions. 

In  order  to  overcome  these  signature  distortions,  we  implement  a  voting  scheme  using  the 
spectral  signatures,  based  on  MDI  [13].  MDI  basically  relies  on  the  same  principles  as  the 
geometric  hashing  method  [28].  The  main  difference  is  that  the  indices  for  the  hash  table  have  few 
dimensions  which  correspond  to  few  invariant  shape  characteristics.  The  low  dimensionality  of 
geometric  hashing  causes  overcrowding  of  bins,  and  the  hash  table  sometimes  saturates  even  with 
a  small  number  of  objects.  On  the  other  hand,  MDI  improves  the  robustness  of  the  recognition 
(which  is  expressed  as  the  ratio  of  the  highest  vote  to  the  next  highest  vote).  This  result  was 
also  observed  by  [34].  The  innovation  of  our  indexing  scheme  is  that  it  is  implemented  in  the 
frequency  domain  using  spectral  signatures.  Additional  merits  of  MDI  are  that  the  retrieval 
size  of  the  database  is  considerably  increased,  the  overcrowding  of  bins  in  the  hash  table  is 
almost  eliminated,  and  coarser  quantization  can  be  used  without  reducing  discrimination.  We 
experimentally  found  that  the  large  dimensionality  in  the  indexing  space  does  not  significantly 
increase  the  search  times. 

In  our  indexing  scheme,  the  hash  table  is  updated  by  each  model  using  all  its  signatures 
5/*,/»A.  11-dimensional  indices  are  generated  for  the  models  ( to  each  index,  an  additional  nine 
dimensional  information  vector  is  also  attached).  Every  index  corresponds  to  a  pair  of  points 
in  the  signature  space  (n,m)  with  respect  to  three  pairs  of  different  relative  frequencies  (/i,/y). 
The  indices  are  based  on  the  following  parameters:  the  offset  of  the  second  point  with  respect 
to  the  first  point  (two  dimensions),  the  directions  of  the  gradients  of  the  signature  at  these  two 
points  (six  dimensions),  the  amplitude  ratios  of  the  signature  values  at  these  two  points  (three 
dimensions).  A  hash  table  is  used  to  store  all  the  indices  and  the  additional  information  vectors 
of  the  models.  The  additional  information  vectors  include  elements  such  as  the  angle  9i  of  the 
kernels  and  the  coordinates  of  the  first  point  (n,  m)  that  are  used  for  deriving  pose  information. 
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The  relative  pose,  which  corresponds  to  relative  shift  of  the  hrst  points  of  model  indices  with 
respect  to  the  first  points  of  object  indices  in  n  and  m,  and  the  angles  of  the  kernels  are  derived 
in  the  process  of  indexing  as  by  products.  These  numbers  can  subsequently  be  translated  to  the 
relative  slant,  tilt,  scale  and  swing  between  the  image  patch  and  the  model. 

In  Eq.  (4),  we  see  that  the  affine  transform  introduces  a  pose  dependent  shift  of  the  signatures 
as  well  as  a  scalar  \C\.  Using  the  ratio  of  amplitudes  as  part  of  the  multi-dimensional  index 
eliminates  the  effect  of  this  scalar  and  also  yields  invariance  to  multiplicative  variations  of  image 
intensity. 

1.4  Experimental  Results 

1.4.1  Recognition  of  Single  Patch  Isolated  Objects 

This  section  describes  experimental  results  using  the  above  mentioned  approach  for  affine- 
invariant  recognition.  In  these  experiments,  according  to  the  notation  of  Eq.  (9)  in  Section  1.2, 
the  kernels  employ  a  set  of  Standard  Deviations  =  8.. .24},  a  set  of  rel¬ 

ative  frequency  coefficients  {{fxjy)  =  (1, 1),  (4,4),  (7,7)},  and  24  orientations  Oi  in  steps  of  7.5 
degrees.  For  a  given  image  patch  I{x,y),  a  set  of  spectral  signatures  is  generated  by 

correlating  it  with  the  above  kernels. 

As  elaborated  in  Section  1.2,  these  signatures  are  used  along  with  a  MDI  scheme  for  affine- 
invariant  recognition.  For  each  model  to  be  included  in  the  hash  table,  signatures  are  generated 
using  the  kernels  y),  and  the  set  of  11-dimensional  indices  are  computed.  Each  index  is 

included  as  an  entry  in  the  hash  table  along  with  the  pose  parameters  of  the  model,  represented 
by  n,  m  and  6i.  Given  an  image  patch  to  be  recognized  invariant  to  affine  transformation,  its 
signatures  and  11-dimensional  indices  are  generated  in  an  identical  fashion.  These  indices  are 
then  compared  with  indices  in  the  hash  table  and  each  matching  index  adds  one  vote  for  the 
corresponding  models  pointed  to  by  a  pointer  in  that  entry.  In  addition,  pose  information  is 
derived  as  described  in  Section  1.3.  The  total  number  of  votes  accumulated  by  each  model  (with 
pose)  over  all  the  indices  of  the  test  image  is  the  matching  score  for  that  model. 

We  use  a  dataset  of  26  objects  (displayed  in  Fig.  6)  in  our  initial  experiments.  Since  the 
experiments  are  mainly  performed  to  test  the  pictorial  affine  invariant  recognition  scheme  in 
this  section,  every  object  in  the  dataset  is  considered  as  a  single  patch.  These  models  consist  of 
randomly  selected,  real  gray-level  images  (128  x  128)  of  objects  with  some  amount  of  texture  as 
well.  A  hash  table  is  created  using  a  single  set  of  signatures  from  each  model.  Experiments  are 
performed  under  varied  conditions  of  slant,  tilt,  scale  and  swing  and  yield  close  to  100%  correct 
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Figure  6:  26  model  objects  in  the  dataset.  Close  to  100%  recognition  is  achieved  over  a  wide  range 
of  slant,  tilt,  scale,  and  swings.  Note  that  many  of  the  models  (such  as  hands)  are  quite  similar  in 
appearance  and  are  still  correctly  classified. 


Figure  7:  Two  test  images  that  correspond  to  affine-transformed  models  (compare  to  airplane  and 
truck  in  Fig.  6). 


recognition  rates  as  illustrated  in  figures  8,  9,  11  and  14.  In  addition,  the  pose  of  each  model  is 
also  estimated  correctly  in  all  experiments. 

Robust  recognition  is  achieved  over  a  range  of  more  than  3  octaves  of  scaling,  slant  angles  of 
more  than  80  degrees  (foreshortening  ratio  of  1:7),  and  image  swing  and  shape  tilt  of  360  degrees. 
Two  of  the  successfully  recognized  test  images  are  displayed  in  Fig.  7. 
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Figure  8:  Correct  recognition  rates  for  scaled  objects.  The  experiments 
are  performed  with  26  models  and  all  the  test  objects  are  scaled  versions  of 
the  models  at  different  scales  from  1  octave  to  -3  octave  (e.g.  from  scale 
factors  of  2  down  to  0.125). 

In  recognition  experiments  in  varied  scale,  as  illustrated  in  Fig.  8,  the  method  achieves  100% 
recognition  rate  even  when  images  are  down  scaled  by  2.5  octaves.  It  should  be  noted  that  the 
images  scaled  halfway  in  between  the  decimation  interval  for  our  Gabor  kernels  are  still  correctly 
recognized.  The  maximum  error  in  pose  estimation  is  1.09  of  the  scale  factor.  In  swing  and 
tilt  experiments,  the  recognition  rates  are  examined  for  the  full  360“  and  are  found  to  be  100%. 
Due  to  the  constant  recognition  rates,  graphs  are  not  presented  for  tilt  and  swing.  In  slant 
experiments,  the  images  are  foreshortened  only  in  one  direction.  Minimal  foreshortening  factor 
is  around  0.0743,  which  corresponds  to  a  slant  angle  of  85.4  degrees.  Fig.  9  shows  the  correct 
recognition  rates  for  different  slant  angles. 

Figure  10  illustrates  three  test  images  that  are  noisy  versions  of  the  corresponding  model 
in  Fig.  6.  Experiments  are  carried  out  with  additive  white  noise,  low-frequency  colored  noise 
(normalized  low  pass  cutoff  frequency  =  7r/2),  and  high-frequency  colored  noise  (normalized  high 
pass  cutoff  frequency  =  7r/2).  For  each  kind  of  noise,  we  experimentally  find  the  largest  noise  level 
at  which  successful  recognition  with  correct  pose  estimation  is  obtained.  As  seen  in  Fig.  10,  the 
test  image  is  successfully  recognized  in  all  three  cases  at  very  high  noise  levels,  demonstrating  that 
the  scheme  is  quite  robust  to  additive  noise.  The  Gabor  kernels  capture  the  image  information 
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Figure  9:  Correct  recognition  rates  for  slanted  objects.  The  model  dataset 
consists  of  26  objects  and  all  the  test  objects  are  slanted  versions  of  the 
models  at  different  slant  angles  from  0"  to  86°. 


a  b  c 


Figure  10;  (a)  Successfully  recognized  test  image  with  additive  white  noise  (SNR=-1.8  dB).  (b) 
Successfully  recognized  test  image  with  additive  low-frequency  colored  noise  (SNR=5.0  dB).  (c)  The 
test  image  is  recognized  even  though  it  is  hardly  seen  (additive  high-frequency  colored  noise  of  SNR=- 
17.0  dB). 


mostly  in  the  low  and  middle  frequencies,  and  thus  the  scheme  is  almost  insensitive  to  high- 
frequency  noise  (Fig.  10(c),  SNR=-17  dB)  since  this  noise  is  outside  the  frequency  range  of  the 
kernels.  The  scheme  is  also  quite  resistant  to  white  noise  (up  to  SNR=-1.8  dB),  and  less  resistant 
(up  to  SNR=5dB)  to  low-frequency  noise  for  the  same  reason.  Thus,  we  can  conclude  that  the 
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Figure  11:  Correct  recognition  rates  for  white-noise  corrupted  objects. 

The  model  dataset  consists  of  26  objects  and  the  test  objects  are  noisy 
and  scaled  versions  of  the  models  at  three  different  scales. 

overall  recognition  scheme  is  quite  robust  to  the  effects  of  additive  noise  and  clutter.  Fig.  11 
gives  the  correct  recognition  rates  for  white-noise  corrupted  images  with  different  levels  of  SNR. 


a  b 

Figure  12:  (a)  The  signature  of  the  noiseless  truck  model  for  Fig.  10.  (b)  The  signature  of  the  noisy 
image  in  Fig.  10c.  Only  the  high  frequency  regions  are  affected  while  the  low  and  middle  frequency 
responses  still  allow  for  robust  recognition. 


In  order  to  explain  the  surprisingly  good  recognition  in  high  frequency  noise,  we  demonstrate 
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a  b  c 


Figure  13:  (a)  Original  high  resolution  model  (128  x  128).  (b)  Medium  resolution  test  image  (64x64). 
(c)  Low  resolution  test  image  (32  x  32). 

the  effects  of  high  frequency  noise  on  the  signature  in  Fig.  12(a)  and  Fig.  12(b).  As  observed,  the 
noisy  signature  in  Fig.  12(b)  (which  corresponds  to  picture  in  Fig.  10(c))  is  affected  only  at  only 
two  boundaries  while  the  rest  is  almost  identical  to  the  signature  in  Fig.  12(a).  This  explains 
why  the  computer  is  still  able  to  recognize  the  image  in  Fig.  10(c)  which  looks  completely  noisy 
to  the  human  eye. 

Figures  13(b-d)  illustrate  test  images  that  have  reduced  resolution  with  respect  to  the  model 
image  in  Fig.  13(a)  (which  is  128  x  128  in  size).  Experiments  were  performed  over  all  the  26 
models  for  each  of  these  resolutions.  To  simulate  affine  transformation  in  addition  to  the  effects 
of  reduced  resolution,  all  the  test  images  correspond  to  model  images  and  are  scaled  by  a  factor 
of  0.8  and  swung  by  30  degrees.  Over  all  the  26  models  in  the  dataset,  the  medium  resolution 
(64  X  64)  set  of  test  images  (see  Fig.  13(b))  yields  100%  successful  recognition  with  the  correct 
pose  estimated  in  all  tests.  In  the  low  resolution  (32  x  32)  experiments  of  Fig.  13(c),  96%  of 
the  test  images  were  successfully  recognized  along  with  accurate  pose  estimation.  These  results 
show  that  the  representation  and  recognition  scheme  is  quite  robust  to  significant  degradation 
that  correspond  to  lower  resolution.  Such  degradation  could  occur  from  large  viewing  distances. 
In  Fig.  14,  the  correct  recognition  rates  under  different  levels  of  resolution  reduction  are  given. 

1.4.2  Recognition  of  Multiple  Patches  and  Non-Isolated  Objects  Using  Model  Based 
Segmentation 

In  all  the  experiments  described  in  Section  1.4.1,  we  consider  every  image  as  a  single  patch.  For 
recognition  of  multiple  objects  in  one  image,  we  first  have  to  obtain  the  spectral  signature  for 
every  local  patch.  Most  objects  encountered  in  daily  life  are  composed  of  a  number  of  primitive 


J.  Ben-Arie 


20 


Figure  14:  Correct  recognition  rates  for  objects  at  different  resolutions. 

The  experiments  are  performed  with  26  object  models  and  the  test  objects 
are  derived  from  the  models  through  low-pass  filtering  and  down-sampling. 

standard  shapes.  Hence,  it’s  assumed  that  a  large  set  of  man  made  objects  (which  include  most 
objects  in  our  model  dataset)  can  be  represented  by  a  small  set  of  standard  primitives.  In 
our  case,  we  prefer  simple  primitives  such  as  rectangles,  semicircles  etc.  that  can  approximate 
many  man  made  flat  surfaces.  For  detection  of  such  primitives  in  the  image,  it  is  advantageous 
to  use  their  boundaries  (edges)  since  this  information  is  more  stable  in  varying  conditions  of 
illumination. 

For  the  detection  of  standard  primitives  that  may  vary  in  their  proportions,  sizes  and  orien¬ 
tations,  is  it  necessary  to  generate  a  set  of  boundary  models.  These  models  cover  the  boundaries 
of  standard  primitives  with  strips  to  allow  for  local  variations  (as  in  Fig.  16(a)).  When  an  edge 
map  of  an  image  is  convolved  with  the  set  of  strip  models,  the  peaks  detected  indicate  possible 
existence  of  shapes  similar  to  the  corresponding  model  set.  In  order  to  detect  primitive  parts  in 
the  image  with  affine-invariance,  the  set  of  strip  models  must  include  all  the  affine-transformed 
versions  of  each  primitive  part.  Since  the  number  of  strip  models  in  the  library  might  be  very 
large,  an  efficient  representation  approach  needs  to  be  developed. 

We  choose  to  employ  here  the  Karhunen-Loeve  (K-L)  transform  which  is  commonly  used  as  an 
optimal  compression  technique  for  images.  For  this  application,  the  K-L  transform  is  employed 
to  compress  strip  models.  A  large  number  of  strip  models  (approximately  44,000  templates)  are 
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Figure  15:  (a)  Recognition  of  neighboring  objects  (airplane  and  mouse)  in  background  clutter,  (b) 
Convolution  score  of  the  edge  map  of  the  image  with  the  semi-circular  strip  template  set.  (c) 
Convolution  score  of  the  edge  map  of  the  image  with  the  rectangular  strip  template  set. 


a  b 


Figure  16:  (a)  Initial  estimation  of  shapes  and  poses.  The  rectangular  and  semicircular  strip  models 
that  correspond  to  peaks  found  in  Fig.  15(b)  and  Fig.  15(c)  are  overlaid  on  the  the  image  with  their 
correct  affine  transforms,  (b)  Final  results  after  flexible  matching. 


approximated  by  only  10  eigentemplates.  The  eigentemplates  are  then  convolved  with  the  edge 
map  of  the  image.  Fig.  15  shows  the  convolution  scores  of  the  edge  map  with  the  K-L  based  set 
of  strip  templates.  At  each  point,  the  score  denotes  the  highest  score  of  the  convolutions  of  all 
strip  models  with  the  edge  image.  As  seen  in  Fig.  15,  sharp  peaks  are  obtained  for  a  slanted 
semi-circle  (Fig.  15(a))  and  a  rectangle  (Fig.  15(b)).  The  peaks  of  the  convolution  provide  a 
robust  model  based  segmentation  of  the  image  as  illustrated  in  Fig.  16(a).  It  also  provides  an 
affine  invariant  initial  identification  of  the  primitives  in  the  image  and  their  poses. 
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At  a  second  stage,  for  more  accurate  segmentation,  an  algorithm  of  a  flexible  matching  of 
the  primitive  parts  to  the  real  edges  in  the  image  is  employed.  The  principle  of  elastic  matching 
implemented  is  similar  to  the  snakes  [15],  But  our  flexible  matching  algorithm  differs  in  the 
aspect  that  our  primitives  models  are  attached  to  a  virtual  elastic  sheet  that  is  distorted  to 
match  the  exact  shape  beneath  it.  When  the  elastic  sheet  finally  locks  on  to  the  real  image, 
the  shape  of  the  corresponding  part  is  determined.  Fig.  16(b)  shows  the  final  result  of  flexible 
matching  for  the  image.  The  primitive  parts  are  also  found  the  same  way  in  the  model  dataset. 
The  signatures  of  the  segmented  parts  in  the  image  are  then  matched  against  the  primitive 
parts  of  the  models.  Since  the  parts  are  segmented  and  isolated,  the  signatures  obtained  are 
not  affected  by  neighboring  objects  and  the  background.  For  example,  we  display  the  objects 
that  match  rectangular  primitives  in  our  26  model  dataset  of  Fig.  6  and  their  matching  scores 
with  the  segmented  mouse  in  Fig.  17.  Even  though  the  segmentation  of  the  mouse  in  Fig.  16(b) 
includes  small  parts  of  the  airplane  wings,  the  matching  scores  of  the  signatures  clearly  classifies 
it  correctly.  Based  upon  the  segmentation  results,  the  airplane  and  the  mouse  are  successfully 
recognized. 

This  approach  is  run  on  a  Pentium  Pro  (200  MHz)  personal  computer.  28224  eleven  dimen¬ 
sional  indices  are  generated  for  each  image  patch.  In  the  experiments  of  isolated  object  recog¬ 
nition,  average  recognition  time  is  around  20  seconds.  It  takes  around  3.5  minutes  to  recognize 
objects  in  cluttered  images.  A  detailed  and  general  analysis  of  time  and  memory  requirements 
for  multidimensional  indexing  can  be  found  in  [13]. 


Figure  17:  The  models  matching  the  rectangular  primitive.  The  matching  scores  of  the  mouse  in 
Fig.  16(b)  to  these  models  are:  0.53,  0.59,  0.50,  0.45,  0.50  and  0.80  for  a)  to  f). 


1.5  Conclusion 

We  present  here  an  approach  for  affine-invariant  object  recognition  by  pictorial  recognition  of 
image  patches  that  correspond  to  object  surfaces  that  are  roughly  planar.  Each  surface  can 
be  recognized  separately  invariant  to  its  3D  pose,  employing  novel  Affine-Invariant  Spectral 
Signatures  (AISSs).  The  3D-pose  invariant  recognition  is  achieved  by  correlating  the  image 
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with  a  novel  configuration  of  Gabor  kernels  and  extracting  loc^  spectral  signatures.  The  local 
spectral  signature  of  each  image  patch  is  then  matched  against  a  set  of  pictorial  models  using 
Multi-Dimensional  Indexing  (MDI)  in  the  frequency  domain.  Affine-invariance  of  the  signatures 
is  achieved  by  a  new  log-log  sampling  configuration  in  the  frequency  domain  which  can  be 
represented  by  short-term  Fourier  transform  or  by  Gabor  transform  in  two  orthogonal  axes.  In 
our  experiments,  we  find  that  spectral  signatures  have  a  significant  discriminative  power  even 
without  phase  information.  100%  correct  affine-invariant  recognition  is  obtained  in  a  range  of 
more  than  3  octaves  of  scaling  and  slant  angles  of  more  than  80  degrees,  with  image  swing  and 
shape  tilt  of  360  degrees  with  a  dataset  of  26  gray-level  models.  Experiments  also  reveal  that 
the  method  works  with  severe  additive  white  and  colored  noise  (SNR  of  -17  dB  to  5  dB)  and 
degraded  resolution.  To  overcome  the  problem  of  recognition  of  non-isolated  objects,  we  develop 
a  model  based  segmentation  scheme.  This  scheme  enables  to  extract  isolated  signatures  of  image 
regions,  which  are  affine  projection  of  a  set  of  basic  geometric  shapes  such  as  rectangle,  triangle, 
semicircle  etc. 
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2  A  Volumetric/Iconic  Frequency  Domain  Representa¬ 
tion  for  Objects  with  application  for  Pose  Invariant 
Face  Recognition 

2.1  Introduction 

A  major  problem  in  3-D  object  recognition  is  the  method  of  representation,  which  actually  de¬ 
termines  to  a  large  extent,  the  recognition  methodology  and  approach.  The  large  variety  of 
representation  methods  presented  in  the  literature  do  not  provide  a  direct  link  between  the  3- 
D  object  representation  and  its  2-D  views.  These  representation  methods  can  be  divided  into 
two  major  categories:  object  centered  and  viewer  centered  (iconic).  Detailed  discussion  are 
included  in  [25]  and  [21].  An  object  centered  representation  describes  objects  in  a  coordinate 
system  attached  to  objects.  Examples  of  object  centered  methods  of  representation  are  spatial 
occupancy  by  voxels  [25,  pp.  468-469],  constructive  solid  geometry  (CSG)  [25,  pp.  468],  su¬ 
perquadrics  [32]  [9],  etc.  However,  object  views  are  not  explicitly  stored  in  such  representations 
and  therefore  such  datfisets  do  not  facilitate  the  recognition  process  since  the  images  cannot 
be  directly  indexed  into  such  a  dataset  and  need  to  be  matched  to  views  generated  by  per¬ 
spective/orthographic  projections.  Since  the  viewpoint  of  the  given  image  is  a  priori  unknown, 
the  recognition  process  becomes  computationally  expensive.  The  second  category  i.e.  viewer 
centered  (iconic)  representation  is  more  suitable  for  matching  a  given  image  with  such  a  dataset, 
since  the  dataset  also  is  comprised  of  various  views  of  the  objects.  Examples  of  viewer  cen¬ 
tered  methods  of  representation  are  aspect  graphs  [26],  quadtrees  [21],  Fourier  descriptors  [45], 
moments  [23],  etc.  However,  in  a  direct  viewer  centered  approach,  the  huge  number  of  views 
needed  to  be  stored  renders  this  approach  impractical  for  large  object  datasets.  Moreover,  such 
an  approach  does  not  automatically  provide  a  3-D  description  of  the  object.  For  example,  in 
representations  by  aspect  graphs  [26],  qualitative  2-D  model  views  are  stored  in  a  compressed 
graph  form,  but  the  view  retrieval  requires  additional  3-D  information  in  order  to  generate  the 
actual  images  from  different  viewpoints.  In  principle,  viewer  centered  aspect  graph  approaches 
do  not  offer  significant  advantage  over  object  centered  approaches.  In  summation,  viewer  cen¬ 
tered  and  object  centered  representations  have  complementary  merits  that  could  be  augmented 
in  a  merged  representation  -  as  proposed  in  this  section. 

A  first  step  in  unifying  object  and  viewer  centered  approaches  is  provided  by  our  recently 
developed  Affine  Invariant  Spectral  Signatures  (AISS)  approach  [7]  [6]  [5],  which  is  based  on  an 
iconic  2-D  representation  in  the  frequency  domain.  However,  the  AISS  is  fundamentally  different 
from  other  viewer  centered  representations  since  each  2-D  shape  representation  encapsulates  all 
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the  appearances  of  that  shape  from  any  spatial  pose.  It  also  means  that  the  AISS  enables  to 
recognize  surfaces  which  are  approximately  planar,  invariant  to  their  pose  in  space.  Although 
this  approach  is  basically  viewer  centered,  it  has  the  advantage  of  directly  linking  3-D  model 
information  with  image  information,  thus  merging  object  and  viewer  centered  approaches.  Hence, 
to  generalize  the  AISS  it  is  necessary  to  extend  it  from  2-D  or  flat  shapes  to  general  3-D  shapes. 
Towards  this  end,  we  describe  in  Section  2.2,  a  novel  representation  of  3-D  objects  by  their  3-D 
spectral  signatures  which  also  captures  all  the  2-D  views  of  the  object  and  therefore  facilitates 
direct  indexing  of  a  given  image  into  such  a  dataset. 

As  a  demonstration  of  the  V/ISS  representation,  it  is  applied  for  estimating  pose  of  faces 
and  face  recognition  in  Section  2.3.  Range  image  data  of  a  human  head  is  used  to  construct  the 
V/ISS  model  of  a  simulated  “generic”  face.  We  demonstrate  that  reconstruction  from  slices  of 
the  V/ISS  results  are  accurate  enough  to  recognize  faces  from  different  spatial  poses  and  scales. 
In  Section  2.3,  we  describe  the  matching  technique  by  means  of  which  a  gray  scale  image  of  a 
face  is  directly  indexed  into  the  3-D  V/ISS  model  based  on  fast  matching  by  correlation  in  a  4 
dimensional  Fourier  space.  In  our  experiments  (described  in  Section  2.5),  we  demonstrate  how 
the  range  data  generated  from  a  model  is  used  to  estimate  the  pose  of  a  person’s  face  in  various 
images.  We  also  demonstrate  the  robustness  of  our  2-D  slice  matching  process  by  recognizing 
faces  with  different  poses  from  a  dataset  of  40  subjects,  and  present  statistics  of  the  matching 
experiments. 

2.2  Volumetric/Iconic  Spectral  Signature 

In  this  section,  we  describe  a  novel  formulation  that  merges  the  3-D  object  centered  representation 
in  the  frequency  domain  to  a  continuum  of  its  views.  The  views  are  also  expressed  in  the  frequency 
domain.  The  following  formulation  describes  the  basic  idea. 

Given  an  object  O,  which  is  defined  by  its  spatial  occupancy  on  a  discrete  3-D  grid  as  a  set 
of  voxels  {V{x,y,z)},  we  assume  without  loss  of  generality,  that  the  object  is  of  equal  density. 
Thus,  V{x,y,z)  =  1  V  {x,y,z}  e  O  and  V{x,y,z)  =  0  otherwise.  The  3-D  Discrete  Fourier 
TVansform  (DFT)  of  the  object  is  given  by 

V(u,  u,  w)  =  T{V{x,  y,  2)}  =  H  Z  y.  (11) 

ti=0  t;=0  u;=0 

where  j  =  The  surface  of  the  object  is  derived  from  the  gradient  vector  field 

VV{x,y,z)  =  \k,-^  +  k,-^+k.-^]V{x,y,z)  , 


(12) 
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where  kg,  ky  and  fc,  are  the  unit  vectors  along  the  x,  y  and  z  axes.  The  3-D  Discrete  Fourier 
TVansform  (DFT)  of  the  surface  gradient  is  given  by  the  frequency  domain  vector  field: 

27r 

{D{x,  y,  z)}  =  J-j^{kzU  +  kyV  -f-  kgw)V{u,  v,  w) .  (13) 

Let  the  object  be  illuminated  by  a  distant  light  source^  with  uniform  intensity  T  and  direction 
i  =  i^kg  4-  iyky  +  igkg.  We  assume  the  surface  has  an  albedo  A{x,y,z).  For  a  voxel  based 
description,  the  gradient  magnitude  |  W  |«  K  (constant).  VV  may  be  estimated  as  V  ♦  VG. 
Thus  the  surface  normal  is  the  given  by  We  assume  that  O  has  a  Lambertian  surface  with 
constant  albedo.  Thus  points  on  its  surface  have  a  brightness  proportional  to 

Biix,y,z)  =  Bf{x,y,z)  +  B^{x,y,z) 

TAf.  d  .  d  .  d 

where  and  B~  are  the  positive  and  negative  parts.  The  function  Bj  {x,  y,  z)  is  not  a  physically 
realizable  brightness  and  is  introduced  only  for  completeness  of  Eq.  (14).  The  separation  of 
the  brightness  function  into  positive  and  negative  components  is  used  to  consider  only  positive 
illuminations.  The  negative  components  are  disregarded  in  further  processing,  as  this  function 
is  separable  only  in  the  spatial  domain.  As  elaborated  in  Section  2.2.1,  BC  can  be  eliminated 
using  a  local  Gabor  transform. 

It  is  also  necessary  to  consider  the  viewing  direction  when  generating  views  from  the  V/ISS. 
The  brightness  function  B^{x,y,z)  is  decomposed  as  a  3-D  vector  field  by  projecting  onto  the 
surface  normal  at  each  point  of  the  surface.  This  enables  the  correct  projection  of  the  surface 
from  a  given  viewpoint.  As  noted  earlier,  the  surface  normal  is  given  by  Thus  the  new 
vectorial  brightness  function  is  given  by 

TA  1 

Bi{x,  y,z)  =  —  [i-  VV {x,  y,  z)]  —VV (x,  y,  z)  .  (15) 

The  3-D  Fourier  transform  of  this  model  is  a  complex  3-D  vector  field  Vi(u,  v,  w)=  T{Bi{x,y,z)}. 
The  transform  is  evaluated  as: 

TA  27r  1  27r 

Vi(u, V, w)  =  +  izw)V{u, V, w)  *  +  kgw)V{u, v, w)  (16) 

where  *  denotes  convolution.  Variation  in  illumination  only  emphasizes  the  amplitude  of  V*  in 
the  {ig,  iy,  ig)  direction,  but  does  not  change  its  basic  structure.  The  absolute  value  of  Vi(tt,  v,  w) 
is  defined  as  the  Volumetric/Iconic  Spectral  Signature  (V/ISS). 


^Additional  light  sources  can  be  easily  handled  using  superposition. 
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Figure  18:  The  Projection-Slice  Theorem:  A  slice  of  the  3-D  Fourier  Transform  of  a  rectangular  block 
(on  the  right)  is  equivalent  to  the  2-D  Fourier  Transform  of  the  projection  of  the  image  of  that  block 
(on  the  left). 

2.2.1  Projection  Slice  Theorem  and  2-D  Views 

The  function  Vi(u,v,w)  is  easily  obtained,  given  the  object  O.  To  generate  the  view  of  the 
object,  we  resort  to  3-D  extensions  of  the  Projection-Slice  Theorem  [24]  [36]  that  projects  the 
3-D  vector  field  Vi(it,  v,  w)  onto  the  central  slice  plane  normal  to  the  viewpoint  direction.  Fig.  18 
illustrates  the  principle  by  showing  the  slice  derived  from  the  3-D  DFT  of  a  rectangular  block. 
Orthographically  viewing  the  object  from  a  direction  c  =  Cxkx  +  Cyky  +  Czkz,  results  in  an  image 
7c(^'>y0>  which  has  a  2-D  DFT  given  by  Xc{u',v').  To  find  and  its  DFT  Ic{u\v'), 

it  is  necessary  to  project  the  vector  brightness  function  B^{x,y,  z)  along  the  viewing  direction 
c  after  removing  all  the  occluded  parts  from  that  viewpoint.  The  vectorial  decomposition  of 
the  brightness  function  along  the  surface  normals  as  given  by  Eq.  (15)  compensates  for  the 
integration  effects  of  projections  of  slanted  surfaces.  This  explains  the  necessity  of  using  a 
vectoriEd  frequency  domain  representation. 

Removing  the  occluded  surfaces  is  not  a  simple  task  if  the  object  O  is  not  convex  or  if  the 
scene  includes  other  objects  that  may  partially  occlude  O.  For  now,  we  shall  assume  that  O  is 
convex  and  is  entirely  visible.  This  assumption  is  quite  valid  for  local  image  analysis  where  a 
local  patch  can  always  be  regarded  as  either  entirely  occluded  or  visible.  Also,  for  local  analysis 

(x,  y,  z)  is  not  a  major  problem.  The  visible  part  of  B^{x,  y,  z)  from  direction  c,  denoted  by 
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y,  z),  is  given  by 

TA  1 

Bic{x,y,z)  =  — hwr[t-VK(x,y,z)]  —hwT[cW{x,y,z)].  (17) 

where  hwrfa]  is  the  “half  wave  rectified”  value  of  a,  i.e.  hwr[a(]  =  a  if  a  >  0  and  hwr[a]  =  0  if 
a  <  0. 

Now  V,  w)  can  be  obtained  from  B^^{x,  y,  z)  simply  by  calculating  the  DFT, 

Viciu,v,w)  =  J^{Bi^{x,y,z)}  .  (18) 

The  image  DFT  Tc{u*,v')  is  obtained  using  the  Projection-Slice  Theorem  [24]  [36]  by  slicing 
V^f,{u,  V,  w)  through  the  origin  u  =  v  —  w  —  0  with  a  plane  normal  to  c,  i.e.  itc*  +  vCy+wc^  —  0. 
Xc{u\  v')  is  derived  by  sampling  u,  w)  on  this  plane.  An  example  of  such  a  slicing  operation 

is  illustrated  in  Fig.  18.  Note  that  actually  encapsulates  both  the  objects  3-D  representation 
and  the  continuum  of  its  view-signatures,  which  are  stored  as  planar  sections  of  |  j.  As  we 
see  from  Eq.  (16),  variations  in  illumination  only  emphasizes  the  amplitude  of  in  {ix,iy,iz) 
direction,  but  do  not  change  its  basic  structure.  Thus,  it  is  feasible  to  recognize  objects  that 
are  illuminated  from  various  directions  by  local  signature  matching  methods  as  described  in 
Section  2.2.3,  while  employing  the  same  signature. 


2.2.2  Local  Signature  Analysis  in  3-D 


Local  signature  analysis  is  implemented  by  windowing  Bi  with  a  3-D  Gaussian  centered  at 
location  {y.x,fiy,Hz)  and  proceeding  as  in  Eq.  (15)  on  the  windowed  object  gradient.  Such  local 
frequency  analysis  removes  the  self-occluded  parts.  Therefore,  we  use  in  our  frequency  analysis 
and  representation,  the  Gabor  Transform  (GT)  instead  of  the  DFT.  The  transition  required  from 
the  DFT  to  the  GT  is  quite  straightforward.  The  object  O  is  windowed  with  a  3-D  Gaussian  to 
give 


BiG  =  G[Bi]  =  Bie 


(19) 


The  equivalent  local  V /ISS  is  given  by 


ViG{u,v,w)  =  Vi{u,v,w)  * 


(20) 


The  important  outcome  fi:om  this  are:  1)  The  Radon  transform  and  the  Projection-Slice  Theo¬ 
rem  [24],  [36]  can  be  still  employed  for  local  space-frequency  signatures  of  object  parts.  2)  In  local 
space-frequency  analysis,  Bi  almost  always  does  not  contain  a  problematic  part,  which  can 
be  eliminated  by  the  windowing  function.  We  note  that  for  most  local  surfaces,  [Bi  •  c]  ~  B,- _,  as 
the  local  analysis  approximates  the  liwr[  ]  function  with  respect  to  viewing  direction  c.  Hence, 
the  V/ISS  of  B^f.  is  a  general  representation  of  a  local  surface  patch  of  V{x,y,z). 


J.  Ben-Arie 


29 


2.2.3  Indexing  using  V/ISS 


As  explained  in  Section  2.2.1,  the  V/ISS  is  a  continuum  of  the  2-D  DFT  of  views  of  the  model. 
To  facilitate  indexing  into  the  V/ISS  data  structure,  we  consider  the  V/ISS  slice  plane  «c*  + 
vcy  +  wcz  =  0,  where  {cx,Cj„  are  the  direction  cosines  of  the  slice  plane  normal.  We  define  a 
4-D  pose  space  in  the  frequency  domain  which  consists  of  the  azimuth  a  and  elevation  c,  defining 
the  slice  plane  normal  with  respect  to  the  original  axes,  the  in-plane  rotation  9  of  the  slice  plane 
and  the  scale  p  which  changes  with  the  distance  to  the  viewed  object.  Fig.  20  illustrates  the 
coordinate  system  used,  [c*,  Cy,  are  related  to  the  azimuth  a  and  elevation  c  as  follows 


'  Cx  ' 

’  COS  a  COS  6  ’ 

Cy 

= 

sin  a  cos  c 

.  . 

sine 

-7r/2  <  o  <  7r/2 
-7r/2  <  c  <  7r/2 


(21) 


We  note  again  that  slices  of  the  V/ISS  are  planes  which  are  parallel  to  the  imaging  plane. 
Thus  the  image  plane  normal  and  the  slice  plane  normal  coincide.  By  using  3-D  coordinate 
transformations  (see  Fig.  20)  we  can  transform  the  frequency  domain  V/ISS  model  to  the  4-D 
pose  space  {a,e,6,p).  Let  {u,v,w)  represent  the  original  V/ISS  coordinate  system  and  {u,v,w) 
be  the  coordinate  system  defined  by  the  slice  plane.  The  slice  plane  is  within  the  2-D  coordinate 
system  {u,v),  where  w  is  the  normal  to  the  slice  plane  (and  also  the  viewing  direction).  The 
relation  between  these  two  systems  is  given  by 

u 

V 

w 

V/ISS  slices,  being  2-D  DFT’s  of  model  views  are  further  transformed  to  polar  coordinates  by 
considering  the  in  plane  rotation  9  (equivalent  to  the  image  swing  or  rotation  about  the  optical 
axis),  and  the  radial  frequency  r/. 


cos  a  sin  e 
sin  O'  sin  e 
—  sine 


—  sin  a  cos  a  cos  e 
cos  a  sin  a  cos « 
0  sine 


u 

V 

w 

(22) 


u 

V 


cos  9 
sin  9 


-7r/2  <9  <1^12 


Tf  =  \/u2  -1-  t)2 


^0  ^  r"  ^  Tnuix 


(23) 


The  radial  frequency  rj  is  transformed  logarithmically  to  attain  exponential  variation  of  r..  Thus 

f  =  log.  ^  (24) 

The  full  transformation  of  the  coordinate  system  to  the  4-D  pose  space  is  given  by 


u 

cos  0  cos  a  sin  c  —  sin^sinor ' 

V 

= 

cos  9  sin  a  sin  c  —  sin  0cos  a 

w  ^ 

—  sin  9  cos  e 

(25) 
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Thus  the  4-tuple  {a,€,B,p)  defines  all  the  points  in  the  3-D  V/ISS  frequency  space  {u,v,w).  We 
observe  that  the  space  defined  by  the  4-tuple  (a,  e,  9,  p)  is  redundant  in  the  sense  that  infinite 
number  of  4-tuples  (a,  c,  9,  p)  may  represent  the  same  (u,  v,  w)  point.  However,  this  represen¬ 
tation  has  the  important  advantage  that  every  (a,c)  pair  defines  a  planar  slice  in 
Moreover,  every  9  defines  an  image  swing  and  every  p  defines  another  scale.  Thus  the  (a,  c,  9,  p) 
representation  significantly  simplifies  the  indexing  search  for  the  viewing  poses  and  scales.  Now, 
the  indexing  can  be  simply  implemented  by  correlation  in  the  frequency  domain  to  immedi¬ 
ately  determine  all  pose  parameters  by  linear  shifts  in  (a,  e,  0,  p)  space.  The  significance  of  this 
transformation  to  the  4-D  pose  space  is  in  using  the  following  properties.  The  polar  coordinate 
transformation  within  the  slice  allows  rotated  image  views  to  have  2-D  frequency  domain  signa¬ 
tures  which  shift  along  the  9  axis.  Similarly  the  exponential  sampling  of  the  radial  frequency  r/ 
results  is  scale  changes  causing  linear  shifts  along  the  p  axis.  Thus  the  new  coordinate  system 
given  by  {a,  e,  9,  p)  results  in  a  2-D  frequency  domain  signature  which  is  invariant  to  view  point 
and  scale  and  results  only  in  linear  shifts  in  the  4-D  pose  space  so  defined.  A  particular  slice 
corresponding  to  a  particular  viewpoint  is  easily  indexed  into  the  transformed  V/ISS  by  using 
correlation. 

2.3  Pose  Estimation  and  Recognition  of  Human  Faces 

Recognition  of  human  faces  is  a  hard  problem  for  machine  vision,  primarily  due  to  the  complexity 
of  the  shape  of  a  human  face.  The  change  in  the  observed  view  caused  by  variation  in  facial  pose  is 
a  continuum  which  needs  large  numbers  of  stored  models  for  every  face.  Since  the  representation 
of  such  a  continuum  of  3-D  views  is  well  addressed  by  our  V/ISS  representation,  we  present 
here,  the  application  of  our  V/ISS  model  for  pose-invariant  recognition  of  human  faces.  First  we 
discuss  some  of  the  existing  work  in  face  recognition  in  Section  2.3.1  followed  by  our  approach 
to  the  problem  in  Section  2.3.2.  We  present  our  results  in  face  pose  estimation  (Section  2.4)  and 
face  recognition  (Section  2.3)  and  compare  our  results  in  face  recognition  to  some  other  recent 
works  using  the  same  database  [31]. 

2.3.1  Face  Recognition:  A  Literature  Survey 

Recent  works  in  face  recognition  have  used  a  variety  of  representations  including  parameterized 
models  like  deformable  templates  of  individual  facial  features  [44]  [38]  [16],  2-D  pictorial  or 
iconic  models  using  multiple  views  [12]  [10],  matching  in  eigenspaces  of  faces  or  facial  features 
[33]  and  using  intensity  based  low  level  interest  operators  in  pictures.  Recent  significant  works 
in  face  recognition  have  used  convolutional  neural  networks  [29]  as  well  as  other  neural  network 
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azimuth  a 


Figure  19:  Reconstructions  of  a  model  face  from  slices  of  the  V/ISS  are  shown  for  various  azimuths 
and  elevations.  Note  that  all  facial  features  are  accurately  reconstructed  indicating  the  robustness  of 
the  V/ISS  model. 


approaches  like  [18]  and  [42].  Hidden  Markov  Models  [37],  modeling  faces  as  deformable  intensity 
surfaces  [30],  and  elastic  graph  matching  [27]  have  also  been  developed  for  face  recognition. 

Parameterized  models  approaches  like  that  of  Yuille  et  al.  [44],  use  deformable  template 
models  which  are  fit  to  preprocessed  images  by  minimizing  an  energy  functional,  while  Ter- 
zopoulos  and  Waters  [38]  used  active  contour  models  of  facial  features.  Craw  et  al.  [16]  and 
others  have  used  global  head  models  from  various  smaller  features.  Usually  deformable  models 
are  constructed  from  parameterized  curves  that  outline  subfeatures  such  as  the  iris  or  a  lip.  An 
energy  functional  is  defined  that  attracts  portions  of  the  models  to  pre-processed  versions  of  the 
image  and  model  fitting  is  performed  by  minimizing  the  functional.  These  models  are  used  to 
track  faces  or  facial  features  in  image  sequences.  A  variation  is  the  deformable  intensity  surface 
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model  proposed  by  Nastar  and  Pentland  [30].  The  intensity  is  defined  as  a  deformable  thin  plate 
with  a  strain  energy  which  is  allowed  to  deform  and  match  varying  poses  for  face  recognition.  A 
97%  recognition  rate  is  reported  for  a  database  with  200  test  images. 

Template  based  models  have  been  used  by  Brunelli  and  Poggio  [12].  Usually  they  operate  by 
direct  correlation  of  image  segments  and  and  are  effective  only  under  invariant  conditions  of  scale 
orientations  and  illumination.  Brunelli  and  Poggio  computed  a  set  of  geometrical  features  such 
as  nose  width  and  length,  mouth  position  and  chin  shape.  They  report  90%  recognition  rate  on 
a  database  of  47  people.  Similar  geometrical  considerations  like  symmetry  [35]  have  also  been 
used.  A  more  recent  approach  by  Beymer  [10]  uses  multiple  views  and  a  face  feature  finder  for 
recognition  under  varying  pose.  An  aflSne  transformation  and  image  warping  is  used  to  remove 
distortion  and  bring  correspondence  between  test  images  and  model  views.  Beymer  reports  a 
recognition  rate  of  98%  of  a  database  of  62  people,  while  using  15  modeling  views  for  each  face. 

Among  the  more  well  known  approaches  has  been  the  eigenfaces  approach  [33].  The  principal 
components  of  the  database  of  normalized  face  images  is  used  for  recognition.  The  results  report 
a  95%  recognition  rate  of  200  faces  from  a  database  of  3000)  However,  variation  in  face  pose  is 
limited.  More  recent  reports  on  a  fully  automated  approach  with  extensive  preprocessing  on  the 
FERET  database  indicate  only  1  mistake  on  a  database  of  150  frontal  views. 

Elastic  graph  matching  using  the  dynamic  link  architecture  [27]  was  used  quite  successfully 
for  distortion  invariant  recognition.  Objects  are  represented  as  sparse  graphs  with  vertices  labels 
with  multi-resolution  spectral  descriptions  and  graph  edges  associated  with  geometrical  distances 
form  the  database.  A  recognition  rate  of  97.3%  is  reported  for  a  database  of  300  people. 

Neural  network  approaches  have  also  been  popular.  Principal  components  generating  using  an 
autoassociative  network  have  been  used  [18]  and  classified  using  a  multilayered  perceptron.  The 
database  consists  of  20  people  with  no  variation  in  face  pose  or  illumination.  Weng  and  Huang 
used  a  hierarchical  neural  network  [42]  on  a  database  of  10  subjects.  A  more  recent  approach  uses 
a  hybrid  approach  using  self  organizing  map  for  dimensionality  reduction  and  a  convolutional 
neural  networks  for  hierarchical  extraction  of  successively  larger  features  for  classification  [29]. 
The  reported  results  show  a  3.8%  error  rate  on  the  ORL  database  using  5  training  images  per 
person. 

In  [37],  a  HMM-based  approach  is  used  on  the  ORL  database.  Error  rates  of  13%  were 
reported  using  a  top-down  HMM.  An  extension  using  a  pseudo  two-dimensional  HMM  reduces 
the  error  to  5%  on  the  ORL  database.  5  training  and  5  test  images  were  used  for  each  of  40 
people  under  various  pose  and  illumination  conditions. 
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2.3.2  V/ISS  model  of  faces 


Figure  20:  The  frequency  domain  coordinate  system  in  which  the  slice  plane  is  defined.  (ci,Cj„Ci) 
are  the  direction  cosines  of  the  slice  plane  normal,  which  has  an  azimuth  a  and  an  elevation  e. 
Image  swing  is  equivalent  to  in  plane  rotation  9,  and  viewing  distance  results  is  variation  in  the  radial 
frequency  r/  of  the  V/ISS  function. 


In  our  V/ISS  model,  we  present  a  novel  representation  using  dense  3-D  data  to  represent  a 
continuum  of  views  of  the  face.  As  indicated  by  Eq.  (18)  in  Section  2.2,  the  V/ISS  model  encap¬ 
sulates  the  information  in  the  3-D  Fourier  domain.  This  has  the  advantage  of  3-D  translation 
invariance  with  respect  to  location  in  the  image  coupled  with  faster  indexing  to  a  view/pose  of 
the  face  using  frequency  domain  scale  and  rotation  invariant  techniques.  Hence,  complete  3-D 
pose  invariant  recognition  can  be  implemented  on  the  V/ISS. 
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Range  data  of  the  head  is  acquired  using  a  Cyberware  range  scanner.  The  data  consists  of 
256  X  512  range  information  from  the  central  axis  of  the  scanned  volume.  360*  of  longitude  is 
sampled  in  512  columns  and  heights  in  the  range  of  25  to  35  cm  is  sampled  in  256  rows.  The  data 
is  of  the  heads  of  subjects  looking  straight  ahead  at  0®  azimuth  and  0*  latitude  corresponding  to 
the  x-axis.  This  model  is  then  illuminated  with  numerous  sources  of  uniform  illumination  thus 
approximating  diffuse  illumination  in  a  well-lit  room.  The  resulting  intensity  data  in  converted 
from  the  cylindrical  coordinates  of  the  scanner  to  Cartesian  coordinates  and  inserted  in  a  3-D 
surface  representation  of  the  head  surface  as  given  by  Eq.  (14). 

The  facial  region  of  interest  to  us  is  primarily  the  frontal  region  consisting  of  the  eyes,  lips 
and  nose.  A  region  corresponding  to  this  area  is  extracted  by  windowing  the  volumetric  surface 
model  with  a  3-D  ellipsoid  with  a  Gaussian  fall  off  centered  at  the  nose.  The  parameters  of  the 
3-D  volumetric  mask  are  adjusted  to  ensure  that  the  eyes,  nose  and  lips  are  contained  within  it, 
with  the  fall  off  beyond  the  facial  region.  The  model  thus  formed  is  a  complex  surface  which 
consists  of  visible  parts  of  the  face  from  an  continuous  range  of  view  centered  around  the  x-axis 
or  the  (0°,0°)  direction.  The  resulting  model  then  corresponds  to  Eq.  (17)  in  our  V/ISS  model. 
Applying  Eq.  (18),  the  V/ISS  of  the  faw:e  is  obtained.  The  V/ISS  model  is  then  resampled  into 
the  4-D  pose  space  using  Eq.  (25)  as  described  in  Section  2.2.3.  Reconstructions  of  a  range  of 
viewpoints  from  a  model  head,  from  the  V/ISS  slices  are  shown  in  Fig.  19.  We  see  from  the 
reconstructions,  that  all  relevant  facial  characteristics  are  retained  thus  justifying  our  use  of  the 
vectorial  V/ISS  model.  This  model  is  used  in  the  face  pose  estimation  experiments. 

2.3.3  Indexing  images  into  the  V/ISS 

Images  of  human  faces  are  masked  with  an  ellipse  with  Gaussian  fall-off  to  eliminate  background 
textures.  The  resulting  image  shows  the  face  with  the  eyes  nose  and  lips.  The  magnitudes 
of  Fourier  transform  of  the  windowed  2-P  face  images  are  calculated.  The  windowing  has  the 
effect  of  focusing  on  local  frequency  components  (or  foveating)  on  the  face,  while  retaining  the 
frequency  components  due  to  facial  features.  The  Fourier  magnitude  spectrum  make  the  spectral 
signature  translation  invariant  in  the  2-D  imaging  plane.  The  spectrum  is  then  sampled  in  the 
log-polar  scheme  similar  to  the  slices  of  the  V/ISS.  As  most  illumination  effects  are  typically 
lower  frequency,  band  pass  filtering  is  used  to  compensate  for  illumination. 

The  spectral  signatures  from  the  gray  scale  images  are  localized  (windowed)  log-polar  sampled 
Fourier  magnitude  spectra.  The  continuum  of  slices  of  the  V/ISS  provide  all  facial  poses,  and 
band-passed  Fourier  magnitude  spectrum  provides  2-D  translation  invariant  (in  the  imaging 
plane)  signatures.  Log-polar  sampling  of  the  2-D  Fourier  spectrum  allows  for  scale  invariance 
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Table  1:  Pose  estimation  errors  for  faces  with  known  pose.  Note  these  are  the  averaged  absolute 


’  1 
Azimuth  Error 

Elevation  Error 

Rotation  Error 

Scale  Std.  Dev. 

4.05° 

5.63° 

2.68° 

0,0856 

(translation  normal  to  the  imaging  plane)  and  rotation  invariance  (within  the  imaging  plane). 
This  is  because  a  scaled  image  manifests  itself  in  Fourier  spectrum  inversely  proportional  to 
the  scale  and  a  rotated  image  has  a  rotated  spectrum.  Thus  scaled  and  rotated  images  have 
signatures  which  are  only  linearly  shifted  in  the  log-polar  sampled  frequency  domain. 

The  pose  of  a  given  image  is  determined  by  correlating  the  intensity  image  signature  with  the 
V/ISS  in  the  4-D  pose  space.  The  matching  process  is  based  on  indexing  through  the  sampled 
V/ISS  slices  and  maximizing  the  correlation  coefficient  for  all  the  4  pose  parameters.  The 
correlation  is  performed  on  the  signature  gradient  which  reduces  dependence  of  actual  spectral 
magnitudes  and  considers  only  the  shape  of  the  spectral  envelope.  The  results  take  the  form  of 
scale  and  rotation  estimate  along  with  a  matching  score  from  0  to  1. 

Similar  approaches  have  been  very  sucessfully  used  to  match  Affine  Invariant  Spectral  Signatures 
(AISS)  [1]  [3]  [7]  [6]  [5].  References  [1]  and  [3]  already  include  detailed  noise  analysis  with  white 
and  colored  noise  which  shows  robustness  to  noise  levels  of  up  to  0  dB  SNR. 


2.4  Face  Pose  Estimation 

To  verify  the  accuracy  of  the  pose  estimation  procedure,  the  method  is  first  tested  on  images 
generated  from  the  3-D  face  model.  20  images  of  the  face  in  Fig.  19  are  generated  using  random 
viewpoints  and  scales  from  uniform  distributions.  The  azimuth  and  elevation  are  in  the  range 
[-30°, 30®],  the  rotation  angle  is  in  the  range  [-45°, 45°)  and  the  scale  in  the  range  [0.5, 1.5]. 
These  are  indexed  in  the  V/ISS  pose  space.  The  results  are  summarized  in  Table  1.  An  example 
of  the  correlation  peak  for  the  estimated  pose  in  azimuth  and  elevation  is  shown  in  Fig.  2.4  for 
the  test  image  in  Fig.  2.4.  The  corresponding  reconstructed  face  from  the  V/ISS  slice  is  shown 
in  Fig.  2.4. 

In  addition,  we  also  show  the  results  of  pose  estimation  of  face  images  of  the  subject  with 
unknown  pose  and  illumination  in  Fig.  24. 
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Figure  21:  A  test  im¬ 
age  with  pose  parameters 
(14°,-8“,4r,1.4). 


Figure  22:  The  correlation 
maximum  in  the  azimuth- 
elevation  dimensions  of  the 
pose  space.  The  peak  is  quite 
discriminative  as  seen  by  rel¬ 
ative  brightness. 


Figure  23:  The  recon¬ 
structed  image  from  the  slice 
which  maximizes  the  cor¬ 
relation.  Pose  parameters 
(10°, -10°,  40»,  1.414). 


Table  2:  Face  recognition  using  the  ORL  database.  Recognition  rates  are  given  for  5,  6,  7  and  8 
images  as  V/ISS  slices.  _ _ _ _ _ 


Number  of  Slices 

5 

6 

7 

8 

Recognition  Rate 

92.5% 

95.6% 

96.6% 

100% 

2.5  Face  Recognition  Results 

In  this  section,  we  describe  experiments  on  face  recognition  based  on  the  V/ISS  model.  The 
ORL  database  [31]  is  used.  The  ORL  database  consists  of  10  images  of  each  of  40  people  taken 
in  varying  pose  and  illumination.  Thus  there  are  a  total  of  400  images  in  the  database. 

We  select  a  number  of  these  images  varying  from  5  to  8  as  model  images  and  the  remaining 
images  form  the  test  set.  The  model  images  are  windowed  with  an  ellipse  with  a  Gaussian  fall- 
off.  The  recognition  is  robust  to  the  window  parameters  selected,  provided  the  value  of  a  for 
the  Gaussian  fall-off  is  relatively  large.  The  images  are  112  x  92  pixels.  The  window  parameters 
chosen  were  30  pixels  for  the  longer  elliptical  axis  aligned  vertically  and  22  pixels  for  the  shorter 
axis  aligned  horizontally  and  <t  =  15  pixels.  Each  window  is  centered  at  (60,46).  This  allows 
for  faster  processing  rather  than  manually  fitting  windows  to  each  face  image.  Thus,  the  same 
elliptical  Gaussian  window  was  used  on  all  model  and  test  images  even  though  its  axes  does 
not  align  accurately  with  the  axes  of  all  the  faces.  The  windowed  images  are  transformed  to 
the  Fourier  domain  and  then  sampled  in  a  log-polar  format,  now  correspond  to  slices  in  a  4-D 
V /ISS  pose  space.  The  test  images  are  then  indexed  into  the  dataset  of  slices  for  each  person. 
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Figure  24:  Using  the  V/ISS  model,  the  pose  of  the  face  in  the  above  images  is  estimated 
and  the  faces  are  recognized.  The  estimated  poses  are  given  in  terms  of  the  4-tuple  az¬ 
imuth  a,  the  elevation  c,  the  relative  swing  (rotation)  6,  and  the  relative  scale  ro  *  a^. 
The  results  are  A:(-hl5°, -l-20“, -1-8°,  1.6818),  B:(-f-10°, -10°, -f4°,  1.0),  C:(-l-0°, -5°, -4°,  1.834), 
D:(+15°,  4-25°,  -1-4°,  1.0).  E:(-t-20°,  -5°,  0°,  1.414)  and  F:(+15°,  -^0°,  -4°,  1.6818). 

The  recognition  rates  using  5,  6,  7  and  8  model  images  are  summarized  in  Table  2.  As  can  be 
seen,  a  recognition  rate  of  92.5%  is  achieved  when  using  5  slices.  This  increases  to  100%  when 
using  8  slices  in  the  model.  A  few  of  the  test  images  that  are  recognized  are  shown  in  Fig.  25. 
Computationally  each  face  indexing  takes  about  320  seconds  when  using  5  slices  and  up  to  about 
512  seconds  when  using  8  slices.  The  experiments  are  performed  on  a  200  MHz  Pentium  Pro 
running  Linux. 
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Figure  25;  Shown  are  images  of  25  faces  from  the  set  of  test  images  which  are  used  for  the  face 
recognition  task  using  our  matching  scheme. 

2.6  Summary  and  Conclusions 

We  present  a  novel  representation  technique  for  3-D  objects  unifying  both  the  viewer  and  model 
centered  object  representation  approaches.  The  unified  3-D  frequency-domain  representation 
(called  Volumetric/Iconic  Spectral  Signatures  -  V/ISS)  encapsulates  both  the  spatial  structure  of 
the  object  and  a  continuum  of  its  views  in  the  same  data  structure.  We  show  that  the  frequency- 
domain  representation  of  an  object  viewed  from  any  direction  can  be  directly  extracted  employing 
an  extension  of  the  Projection  Slice  theorem.  Each  view  is  a  planar  slice  of  the  complete  3-D 
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V/ISS  representation.  Indexing  into  the  V/ISS  model  is  shown  to  be  efficiently  done  using  a 
transformation  to  a  4-D  pose  space  of  azimuth,  elevation,  swing  (in  plane  image  rotation)  and 
scale.  The  actual  matching  is  done  by  correlation  techniques. 

The  application  of  the  V /ISS  representation  is  demonstrated  for  pose-invariant  face  recogni¬ 
tion.  Pose  estimation  and  recognition  experiments  is  carried  out  using  a  V/ISS  model  constructed 
from  range  data  of  a  person  and  using  gray  level  images  to  index  into  the  model.  The  pose  esti¬ 
mation  errors  are  quite  low  at  about  4.05®  in  azimuth,  5.63®  in  elevation,  2.68®  in  rotation  and 
0.0856  standard  deviation  in  scale  estimation.  The  standard  deviation  in  scale  is  taken  for  the 
ratio  of  estimated  size  to  true  size.  Thus  it  represents  the  standard  deviation  assuming  a  scale 
of  1.0.  Face  recognition  experiments  are  also  carried  out  on  a  large  database  of  40  subjects  with 
face  images  in  varying  pose  and  illumination.  Varying  number  of  model  images  between  5  and  8 
is  used.  Experimental  results  indicate  recognition  rates  of  92.5%  using  5  model  images  and  goes 
up  to  100%  using  8  model  images.  This  compares  well  with  [37]  who  reported  recognition  rates 
of  87%  and  95%  using  the  same  database  with  5  training  images.  The  eigenfaces  approach  [33] 
was  able  to  achieve  a  90%  recognition  rate  [29]  on  this  database.  It  also  is  comparable  to  the 
recognition  rates  of  96.2%  reported  in  [29]  again  using  5  training  images  per  person  from  the  same 
database.  These  are  highest  reported  recognition  rates  for  the  ORL  database  in  the  literature. 
The  V/ISS  model  holds  promise  as  a  robust  and  reliable  representation  approach  that  inherits 
the  merits  of  both  the  viewer  and  object  centered  approaches.  We  plan  future  investigations  in 
using  the  V/ISS  model  for  robust  methods  in  generic  object  recognition. 
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